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ABSTRACT 

An investigation of the effect on the difficulty 
value of an item due to position placement within a test was made. 

Using a 60-item operational test comprised of 5 subtests^ 60 items 

were placed as experimental items on a number of spiralled test forms 
in three different positions (first, middle, last) within the subtest 
composed of like items. Item data used resulted from Rasch 
one-parameter item response calibrations. Variations among the mean 
Rasch difficulties lay well; within one standard deviation. Except for 
a few outliers, the item difficulty values graph within the 95 
percent confidence limits for evaluating overall stability of the 
esti?iates. Thus, the consistency of these estimates support the 
notion that Rasch item paramet4jrs are not importantly affected by the 
position of an item. (Author) 
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ABSTRAiST 



The Effect of the PositVon of ah Item 
Within d Test on the Item Difficulty Valoe - 

An investigation of the effect on the difficulty vdlUe of dn item due to 
pbsitibh placement within a test was. made. Using a sixty-item bperdtidhdl 

- test comprised of five subtests^ sixty items were placed as experimental 
items oh a number of spirdlled test forms in three different positions (first, 
middle, last) within the subtest composed of like items. Item data used 
resulted from Rasch brie-pdrdmeter item response calibrations. Variations 
among the mean Rasch difficulties lay Well within one standard deviation. 
Except for a few outliers, the item difficulty values grdph within the 95% 

^^onfidence limits for evaluating overall stability of the estimates. ThUs, the 
consistency of these estimdtes support the notion that Rdsch item 
parameters are not importdntly affected by the pbsitibh bf an item. 
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The Effect of the Position of ah Item 
Within a Test oh the Item Difficulty Value 



A deficiency present in classical test theory approaches is that both item and test 
characteristics are dependent bri the specific attributes of the exarninees on which the 
statistics are gathered. For example, characteristics of items such as difficulty and 
discrimination vary across ^ro^s of examinees with different distributions of ability. 
Test indices relating to reliability and validity also are affected by the abilities of , the 
examinees taking the test. In contrast, one of the most important attributes of item 
response theory is the supposed ihvariance of item parameters across groups (Lord, 
1980). That is, the characteristics of each item can be described by one set of values. 

r 

This quality should allow test developers to gather item statistics on one occasion and 
use the information subsequently to compile tests having predetermined characteristics. 

In item response theory a major assurnption made is that the difficulty of 
individual items is not altered by the test context in which the items are placed. In the 
classical test theory approach, the validity of this assurnption is hot crucial since data 
are tSlleeted on the tpst as a single entity (Whitely & Dawis, 1976). However, since 
item response theory requires cbllectibn of data on items, fcontext effects occurring as 
the general result of the sequencing of iterhS or as the result of specific eharaeteristics 
of the other items in the test could hlave important influences. In practice, statistics 
are gathered on items either BFm^ans oi sj5ecial field test procedures or by placing 
experimental iterris on tests administered operationally to examinees. After test form 
specific statistics are calculated for these new items, the items are linked to the 
common scale of an item pool or item bank to await use on future tests. Since items 
are chosen for new tests on the basis of the statistics previously gathered, the accuracy 
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of these statistics is very important to the integrity of any future test "These item 
parameter values also influence the trait values that are.^btained subsequently for any 
given examinee's pass/fail responses to the items, and the parameter values influence 
the standard error of the trait value provided by the latent trait model" (Yen, 1980, p. 

297). ; 

Whitely and Dawis (1976) investigated context effects on classical (p-value) arid 

Rasch (one^arameter logistic model) item difficulties by using a verbal analogies test, 

A core of fifteen items were placed on severi different tests. Each test consisted of < 

?ixty items,: the fifteen core and forty-five Unique items. The tests^ administered in 

sixty-minute sessions, were distributed randomly in each of severi sessions. Of the 

fifteen 'items^ six had statistically significant differences in Rasch difficulties and nine 

had statistically significant differences in classical difficulties jacross the seven tests. 

_ iflr 

Yen (1980) -compared differences in cbritext effects for mathematics and reading items 

f - 
in both the three- and one-parameter logistic models. It was fourid that item 

parameters estimated from the same context are more highly related than item 

parameters estimated from different contexts. Also, context effects appeared for bath 

the three-parameter and the Rasch item difficulties^ weaker for the three-parameter 

than for the Rasch on the reading items, and the reverse bri the mathematics items. In 

addition, although context effects were found to influence the shape of the obtained 

item and test characteristic curves, these influences were less for the Rasch model 

than for the three-parameter model. 

As Kingston and Dbraris (1982) point dut^ there are only two alternatives when 
considering the use of preealibrated items on a test. Either the item rtrUst be placed in 
the same position on the new test as on the test used for item parameter calibration, or 
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the position of an iteril must make m appreciable differeace an the item difflcuityi 
Because the first alternative is usually not feasible on account of the administrative 
complexities Involved, a systematic investigation was made oh the effect on obtained 
itemi difficuliies when the item^s pbsltfen varied. Experimental items were placed at 
the beginning, middle, and end of subtests composed of like items. 

METHOD 

The data for this studj^ carhe froin the Mtarch 1983 administration of the Virginia 
Minimum Competency Reading test given to approximately 8Q,00G students. The Rasch 
'one-parameter logistic model had biea ch(5Sen as the basis for test development and 

longitudinal equating for this program- TSius, items used are selected to fit this model. 

f.- - - - - ' 

the regular editions of the reading test are comprised of sixty operational items divided 

into five competencies or subtests (twenty items in the first competency arid ten items 
bri each of the other four cdmpetenciesX the test is similar in format and eoritent to 
the I0X Basic Skill Tests; Secondary Level, Readiiig (lOX, 1978). When experimental 
items are placed on the forms, the total riumber of items per form is usually raisedjo 
eighty; however, fbr this investigation the total riumber of items per form was eighty- 
four. The test lis administered with ho time limit. For this study^ a total of sixty 
experimental Items were placed on different forms in each of three pbsitioris ^first^ 
riiidcfle, last) within their reiq5ective competency. 

Eighteen forms of the test, ebritairiirig the same operational items but different 
experimental items, were administered m a spiralled fashion U-e., packagirig the forms 
in sequential ordefj with packages beginning witli as many different form numbers as 
forms being administered). This type of administration resulted in randomly parallel 
groups taking each form. After all student answer sheets were scanned, a random 




sample of 10^1300 students was drawn for the ptirpose of calibrating the iterhs using the 
BICAL III computer program (Wright, Mead, & Bell, 1979). Thus, the items in each of 
the forms were calibrated witn a sample of approximately 550 students. When the item 
difficulties in a form are calibrated by BICAL III, the mean of the item difficuities Is 
set to zero. Item difficulty calibfatidhs are anchored relative to the ether items in the 
test form. The sixty operational items of the March administration constituted the 
core of items used to link the eighteen forms together and to the common scale of the 
existing item bank (Wright & Stone^ 1979). No experimental items were used in the 
linking process. The item difficulty parameters reported are those adjusted to the scale 
of the existing bank. The p-values are the actual obtained vaities. 

RESULTS 

f ^ The means and standard deviations of all the item difficulty estimates in each of 
the three positions (firsts rriiddle, last) are presented in Table 1. The greatest difference 
in rheari difficulty values is- between the first and middle positions and that is .l^^. 
Between first and last position the difference in means is .049 and between middle and 
last the difference is .095. The means of the difficulty estimates for the items within, 
each subtest is displayed graphically in Figure 1. The greatest variation in means is in 
the fifth subtest and the least is in the fourth subtests. 

In Table 2 the means and the standard deviations of the p-values of. ail the items 
in their respective positions are presented. These mean p-values differed by .003 to 
.012^ with the greatest difference between the first and middle positions. The rrieah p- 
values for the items within each of the subtests as shown in the graph in Figure 2. 
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The mean ability estimates for each of the forms wherein the items under 
discussion were placed are presented in table 3. These range from a high of 2.99 to a 
low of 2.71. table 3 also contains the person separability indices (PSI) for each 
experimental form. This index calcalated during the BICAL III calibrations of the item 
difficulty estimates is similar to the index of subject separability (ISS) described by 
Gustafssoh 0977). ^ - 

Figures 3-5 display graphs of the difficulty estimates of each item in one position 

plotted agamst the difficulty estimate of the same item in another position. The 

correlation coefficients relating to the graphs are also presented. . All three 

correlations are or higher. ^ 

J 

A one-way analyses of variance indicated that there was no significant difference 
between the means o! the Rasch difficulty estimates of the items placed in each of the 
three positions^ P (2^118) = 2.57, £ .05. 

DiseussieN 

Some variation can be seen amfeng the rrieah difficulty value estimates for the 
items in the different positions (first, middle, last) within their respective subtests; 
however, the differences between these means lay well within one standard deviatidh. 
The rhean p-values show less variation among the different positions. 

Since the examinees taking each form are randomly assigned from the same 
population, the ability estimates were expected to be sirhilar and they were. All forms 
contained at least two experimcmal items other than those used for this study, so the 
effect of the different positions of the items on the^^ability estimates can not be 
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determined conclusively. However, the ihfbrrriatidh presented shows that the ability- 
estimates of the examinees taking earh form are very similar; 

The person separability index (PSI) is almost identical for ail forms. This index 
serves as a counterpart to the coefficient of reliabiiity when direct estimates of the 
variance of the errors of measurement can be obtained (Gustafsson, 1977). The PSI is 
sample specific and is lower when the ability level of the examinees is not measured 
precisely. This seems to be the case in this study. ^The ability estimates on the forms 

vary from 2.99 to 2.71, when the items have a mean difficulty value of zero. Because 

__ _ » 

the data were derived from an administration of a high school minimum competency 

testi it might be expected that the mean of the difficulty estimates of the items would 

be much lower than the mean of the examinee ability. In such circumstances, values of > 

the PSI are expected to be in the 0.80 to 0.85 range because the test Is not precise 

(Wright & Stone, 1979). ' 

Viewing the graphs of the cornparative Item clifficulty values, the invariance 
property of the Rasch model becomes evident. Except for a few outliers, all points lie 
within the 95% confidence lirhits for evaluating the overall stability of the difficulty • 
estimates for the same itemras described by Wright and Stone (1979). The statistic 
suggested by Wright and Stone is t.^^ (d^.^ - ^ix^^^^cj ~^ ' ^^^^ approximate 

hbrrrial distribution having a mean of 0 and a standard deviation of 1. (s*, + s*^) is an 
estimate of the expected standard error of the difference between two difficulty 
estimates d.^ and d^, independently calibrated, for one parameter^-. The reasons for 
the items producing irfebrsistent difficulty parameters ^ire not apparent* No iterS 
appears as an outlier on ail three graphs* 



the consistency of the difficulty estimates of the itertis placed in different 
positions seems to support the notion that Rasch item parameters are not importantly 
affected by the position of an item. Context effects such as those produced by the 
individual characteristics of adjacent items, as opposed to general position effects, rnay 
play a part in causing the few items to be outliers. However, this study concentrated 
only on the general position effects. Other investigations are planned to look at 
specific context effects within the subtests. 
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Table 1 



Item Diffieulty Values 



Position 
First 
Middle 
Last 



Mean 
.936 
.792 
.887 



1.200 
1.313 
1.205 



Table 2 



p-values 



Position 
First 
Middle 
tast 



Mesh 
.833 
.845 
.835 



.103 
.103 
.101 
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(Logits) 
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Figure !. Mean Item Difficuity Estimates 
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t^ble 3 



Form Statistics 
Ability Estimates (Log l ts) 



Mean 


S.D. 


2.73 


1.14 


2,75 


1.20 


2.89 


1„22 


2.74 


1.19 


2.91 


1.15 


2.71 


1.19 


2.89 


1.12 


2.99 


1.20 


2.75 


1.12 


2.94 


1.10 



Person Separability 
Indices 

.84 

.85 

.83 
.85 

,82 
.83 
.82 
.83 
.84 
.81 
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Figure 3. Item Diffico[ty Estimates (Lbgits) 
First dhd Middle Positions 
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Figure 4. Item bifficuljg^ Estimates (bogits) 
First dhd Last Pdsittons 
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5. Item bifficuity Estirndtes (Logits). 
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